from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')
In this notebook, I perform data analysis using to reach answers to some of the questions explained in first notebook -
Here are my initial hypothesis -
#import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandasql import sqldf
import os
import matplotlib as mpl
mpl.rc('lines', linewidth=2, color='r')
mpl.rc('font', size=16)
mpl.rc('axes', labelsize=16, grid=True)
mpl.rc('xtick', labelsize=14)
mpl.rc('ytick', labelsize=14)
mpl.rc('legend', fontsize=14)
# Import the data using pandas
athlete_events = pd.read_csv('../Dataset/athlete_events.csv')
noc_data = pd.read_csv('../Dataset/noc_regions.csv')
#Combine the datasets
pysqldf = lambda q:sqldf(q, globals())
player_stats = pysqldf('SELECT ath.*, noc.region AS Region\
FROM athlete_events AS ath \
LEFT OUTER JOIN noc_data AS noc \
ON ath.NOC=noc.NOC \
ORDER BY noc.NOC')
player_stats.head(10)
To understand the gender disparity in Olympics, let's start by checking total number of male and female players who have participated in olympics. This will be followed by looking at the regional distribution of male and female players in every region over all the years.
Let M be number of male players and F be number of female players. Then, for any region A, we define gender disparity as
$$Disparity = M - F$$ $$Disparity (\%) = (M - F)/(M + F) * 100.$$
Let's first look at total male vs female players during the entire Olympics
#SQL
male_female_num = pysqldf("SELECT Sex, COUNT(DISTINCT ID) FROM player_stats GROUP BY Sex")
print(male_female_num.head(30))
#Pandas
total_players = np.size(player_stats.ID.unique())
male_players = np.size(player_stats[player_stats.Sex=='M'].ID.unique())
female_players = np.size(player_stats[player_stats.Sex=='F'].ID.unique())
male_players_ID = player_stats[player_stats.Sex=='M'].ID.unique()
assert total_players == male_players+female_players, "Total players !=male players + female players. Please check your query."
print("Male Players = {}, Female Players = {}, % disparity = {:0.2g}".format(male_players, female_players, (male_players-female_players)/total_players*100.))
Such differences can arise due to various reasons mostly specific to individual countries and their policities. To get the better understanding of the cause, let's take a look at regional disparity in the world.
#Query total players from different regions based on their sex - Include datapoints for all the years
male_reg_stats = pysqldf("SELECT Region, COUNT(DISTINCT ID) AS NumMalePlayers \
FROM player_stats \
WHERE NOT(Region ISNULL) AND SEX='M'\
GROUP BY Region ")
female_reg_stats = pysqldf("SELECT Region, COUNT(DISTINCT ID) AS NumFemalePlayers \
FROM player_stats \
WHERE NOT(Region ISNULL) AND SEX='F'\
GROUP BY Sex, Region ")
mf_reg_stats = pysqldf("SELECT ml.Region AS Region, ml.NumMalePlayers AS NumMalePlayers, \
fm.NumFemalePlayers AS NuMFemalePlayers, \
(ml.NumMalePlayers - fm.NumFemalePlayers) AS Disparity,\
(ml.NumMalePlayers - fm.NumFemalePlayers)*100/(ml.NumMalePlayers + fm.NumFemalePlayers) AS Disparity_pct\
FROM male_reg_stats AS ml LEFT JOIN female_reg_stats AS fm\
ON ml.Region=fm.Region\
ORDER BY Region ")
#Look at countries where more than 75% of players are males (can be changed here)
disparity_pct_cutoff = 75.0
print("Total countriess with disparity > {} = {}".format(disparity_pct_cutoff,mf_reg_stats[mf_reg_stats.Disparity_pct>disparity_pct_cutoff].Region.unique().size ))
fig, (ax1, ax2) = plt.subplots(2,1, figsize=(12,12), sharex=True)
ax1.scatter(mf_reg_stats[mf_reg_stats.Disparity_pct>disparity_pct_cutoff].Region,\
mf_reg_stats[mf_reg_stats.Disparity_pct>disparity_pct_cutoff].Disparity, marker='x')
#ax1.set_xlabel('Region')
ax1.set_ylabel('#Males - #Females')
ax1.tick_params(labelrotation=90)
ax1.ticklabel_format(axis="y", style="sci", scilimits=(0,0))
ax2.scatter(mf_reg_stats[mf_reg_stats.Disparity_pct>disparity_pct_cutoff].Region,\
mf_reg_stats[mf_reg_stats.Disparity_pct>disparity_pct_cutoff].Disparity_pct, marker='x')
ax2.set_xlabel('Region')
ax2.set_ylabel('% Difference')
ax2.tick_params(labelrotation=90)
plt.tight_layout()
plt.show()
plt.close()
Analysis - From the plot above, we see that the gender inequality in sports is significant across the world with 23 countries having gender disparity > 75%. As our initial hypothesis, most of these countries are developing nations, which have been known to have restrictive laws towards woman rights and freedom compared to developed nations.
Let's see how this looks on a world map to get a better picture of disparity distribution across the world. This would make it easier to relate with development index, gender inequality and population dynamics etc.
import plotly
import chart_studio.plotly as py
import plotly.express as px
import plotly.graph_objects as go
mapdata = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2014_world_gdp_with_codes.csv')
#Let us check if there are any missing countries/ mismatching data
#print(mapdata.COUNTRY.unique())
print("Following country names not found in the \'2014_world_gdp_with_codes\' dataset")
for region in mf_reg_stats.Region.unique():
country_found=False
for country in mapdata.COUNTRY.unique():
if region in country:
country_found=True
if country_found==False:
print("{} ".format(region))
A quick look in the dataset shows that most of these regions have slightly different names. Let's modify out database to correct for these regions.
mf_reg_stats_corrected = pysqldf("SELECT *, \
(CASE WHEN Region='Boliva' THEN 'Bolivia'\
WHEN Region='Cape Verde' THEN 'Cabo Verde'\
WHEN Region='Democratic Republic of the Congo' THEN 'Congo, Democratic Republic of the'\
WHEN Region='Ivory Coast' THEN 'Cote d''Ivoire'\
WHEN Region='Myanmar' THEN 'Burma'\
WHEN Region='North Korea' THEN 'Korea, North'\
WHEN Region='Republic of Congo' THEN 'Congo, Republic of the'\
WHEN Region='South Korea' THEN 'Korea, South' \
WHEN Region='UK' THEN 'United Kingdom'\
WHEN Region='USA' THEN 'United States'\
WHEN Region='Virgin Islands, British' THEN 'British Virgin Islands'\
WHEN Region='Virgin Islands, US' THEN 'Virgin Islands' \
ELSE Region END) AS Country\
FROM mf_reg_stats ")
#print(sorted(mf_reg_stats_corrected.Country.unique()))
#px.data.gapminder().query("year==2007")
df = pysqldf("SELECT mapdata.*, COALESCE(mf_reg_stats_corrected.Disparity_pct,0) AS Disparity_pct FROM mapdata LEFT JOIN mf_reg_stats_corrected ON mapdata.COUNTRY=mf_reg_stats_corrected.Country")
figdir = '/Users/bhaveshkhamesra/Documents/GeorgiaTech/CourseEra/SQL_Course4/Presentations'
fig = go.Figure(data=go.Choropleth(
locations = df['CODE'],
z = df['Disparity_pct'],
text = df['COUNTRY'],
colorscale = 'Inferno',
autocolorscale=False,
reversescale=True,
marker_line_color='darkgray',
marker_line_width=0.5,
colorbar_ticksuffix = '%',
colorbar_title = 'Disparity(%)<br>',
))
fig.update_layout(
title_text='Olympic Players - Regional Disparity',
geo=dict(
showframe=False,
showcoastlines=False,
projection_type='equirectangular'
),
)
fig.show()
#plotly.io.orca.config.executable = '/Users/bhaveshkhamesra/Softwares/anaconda2/envs/sparkpy/'
#fig.write_image(os.path.join(figdir,'WorldMap1.png'), height=800, width=1200, scale=10)
From the above distribution, most of the countries in the world have disparity > 30%. Few countries like China and Bhutan have done exceptionally well with almost same or more representation of female players compared with male players. One the other hand, even most of the developed nations have struggled with this challenge with disparity between 30% (like Canada, Australia) all the way to 70% (Portugal, Finland, Belgium, Luxemburg). Most of the developing countries have pretty high disparity percent generally greater than >50% including several Asian, African and South American countries. Many third world countries in Asia, Africa and South America also have significantly high disparity (>70%).
One can focus on nations with more than 90% disparity in participation. Most of these countries are Islamic nations like Afghanistan, Pakistan, Iran, Iraq, Qutar, Saudi Arabia and Seria where more than 90% of sports players are men. This isn't unexpected as some of these countries have been infamous for restrictive laws and policies towards woman's rights and freedom. Some of these countries are also economically underdeveloped which can also be an important factor. However, one should note that the problem is not because of religion but more about the local policies of each nation.
We note that here we have included the entire dataset from all the years since 1896. To get a more recent picture, it will be better to look at more recent dataset. Also, it will be interesting to see how country's economic growth and development index is related with the disparity factor.
#Query total players from different regions based on their sex - Include datapoints for all the years
male_reg_stats_1980 = pysqldf("SELECT Region, COUNT(DISTINCT ID) AS NumMalePlayers \
FROM player_stats \
WHERE NOT(Region ISNULL) AND SEX='M' AND Year>1980\
GROUP BY Region ")
female_reg_stats_1980 = pysqldf("SELECT Region, COUNT(DISTINCT ID) AS NumFemalePlayers \
FROM player_stats \
WHERE NOT(Region ISNULL) AND SEX='F' AND Year>1980\
GROUP BY Sex, Region ")
mf_reg_stats_1980 = pysqldf("SELECT ml.Region AS Region, ml.NumMalePlayers AS NumMalePlayers, \
fm.NumFemalePlayers AS NuMFemalePlayers, \
(ml.NumMalePlayers - fm.NumFemalePlayers) AS Disparity,\
(ml.NumMalePlayers - fm.NumFemalePlayers)*100/(ml.NumMalePlayers + fm.NumFemalePlayers) AS Disparity_pct\
FROM male_reg_stats_1980 AS ml LEFT JOIN female_reg_stats_1980 AS fm\
ON ml.Region=fm.Region\
ORDER BY Region ")
mf_reg_stats_corrected = pysqldf("SELECT *, \
(CASE WHEN Region='Boliva' THEN 'Bolivia'\
WHEN Region='Cape Verde' THEN 'Cabo Verde'\
WHEN Region='Democratic Republic of the Congo' THEN 'Congo, Democratic Republic of the'\
WHEN Region='Ivory Coast' THEN 'Cote d''Ivoire'\
WHEN Region='Myanmar' THEN 'Burma'\
WHEN Region='North Korea' THEN 'Korea, North'\
WHEN Region='Republic of Congo' THEN 'Congo, Republic of the'\
WHEN Region='South Korea' THEN 'Korea, South' \
WHEN Region='UK' THEN 'United Kingdom'\
WHEN Region='USA' THEN 'United States'\
WHEN Region='Virgin Islands, British' THEN 'British Virgin Islands'\
WHEN Region='Virgin Islands, US' THEN 'Virgin Islands' \
ELSE Region END) AS Country\
FROM mf_reg_stats_1980 ")
#print(sorted(mf_reg_stats_corrected.Country.unique()))
#Look at countries where more than 75% of players are males (can be changed here)
disparity_pct_cutoff = 75.0
#px.data.gapminder().query("year==2007")
df = pysqldf("SELECT mapdata.*, COALESCE(mf_reg_stats_corrected.Disparity_pct,0) AS Disparity_pct FROM mapdata LEFT JOIN mf_reg_stats_corrected ON mapdata.COUNTRY=mf_reg_stats_corrected.Country")
fig = go.Figure(data=go.Choropleth(
locations = df['CODE'],
z = df['Disparity_pct'],
text = df['COUNTRY'],
colorscale = 'Inferno',
autocolorscale=False,
reversescale=True,
marker_line_color='darkgray',
marker_line_width=0.5,
colorbar_ticksuffix = '%',
colorbar_title = 'Disparity(%)<br>',
))
fig.update_layout(
title_text='Olympic Players - Regional Disparity (since 1980)',
geo=dict(
showframe=False,
showcoastlines=False,
projection_type='equirectangular'
),
)
fig.show()
print("Total countriess with disparity > {} = {}".format(disparity_pct_cutoff,mf_reg_stats_corrected[mf_reg_stats_corrected.Disparity_pct>disparity_pct_cutoff].Region.unique().size ))
fig, (ax1) = plt.subplots(1,1, figsize=(12,7), sharex=True)
ax1.scatter(mf_reg_stats_1980[mf_reg_stats_1980.Disparity_pct>disparity_pct_cutoff].Region,\
mf_reg_stats_1980[mf_reg_stats_1980.Disparity_pct>disparity_pct_cutoff].Disparity_pct, marker='x', label='Year = 1980-2016')
ax1.scatter(mf_reg_stats[mf_reg_stats.Disparity_pct>disparity_pct_cutoff].Region,\
mf_reg_stats[mf_reg_stats.Disparity_pct>disparity_pct_cutoff].Disparity_pct, marker='o', label='Year = 1896-2016')
ax1.set_xlabel('Region')
ax1.set_ylabel('% Difference')
ax1.tick_params(labelrotation=90)
plt.legend()
plt.tight_layout()
plt.show()
plt.close()
As can be seen from above plots, the situation around the world has considerably improved. For most developed countries and developing countries, the disparity is less than 40%. Countries like China and Burma has reversed the trend with more female players than male players participating in the Olympics. The situation has also improved in many Islamic nations with disparity reducing to less than 70% though for some countries like Iran, Iraq, Saudi Arabia etc, even in the modern times, the situation hasn't changed much which can be concerning issue. The above plot shows countries with more htan 75% disparity and is predominantly occupied by Islamic nations. To see if the disparity has actually changed with time, let's look at disparity as function of time (for all regions combined).
male_year_stats = pysqldf("SELECT Year, COUNT(DISTINCT ID) AS NumMalePlayers \
FROM player_stats \
WHERE SEX='M' \
GROUP BY Year ")
female_year_stats = pysqldf("SELECT Year, COUNT(DISTINCT ID) AS NumFemalePlayers \
FROM player_stats \
WHERE SEX='F' \
GROUP BY Year ")
mf_year_stats = pysqldf("SELECT ml.Year AS Year, COALESCE(ml.NumMalePlayers,0) AS NumMalePlayers, \
COALESCE(fm.NumFemalePlayers,0) AS NuMFemalePlayers, \
(COALESCE(ml.NumMalePlayers,0) - COALESCE(fm.NumFemalePlayers,0)) AS Disparity,\
(COALESCE(ml.NumMalePlayers,0) - COALESCE(fm.NumFemalePlayers,0))*100/\
(COALESCE(ml.NumMalePlayers,0) + COALESCE(fm.NumFemalePlayers,0)) AS Disparity_pct\
FROM male_year_stats AS ml LEFT JOIN female_year_stats AS fm\
ON ml.Year=fm.Year\
ORDER BY Year ")
#print(mf_year_stats.head(10))
disparity_fit = np.polyfit(mf_year_stats.Year, mf_year_stats.Disparity_pct, 2)
correlation = np.corrcoef(mf_year_stats.Year, mf_year_stats.Disparity_pct)[0,1]
print("Correlation between year and % disparity = {}".format(correlation))
plt.figure(figsize=(12,6))
plt.plot(mf_year_stats.Year, mf_year_stats.Disparity_pct)
plt.plot()
plt.xlabel('Year')
plt.ylabel('% Disparity ')
plt.show()
plt.close()
#print(mf_year_stats.head(10))
disparity_fit = np.poly1d(np.polyfit(mf_year_stats.Year, mf_year_stats.Disparity_pct, 2))
correlation = np.corrcoef(mf_year_stats.Year, mf_year_stats.Disparity_pct)[0,1]
year_predict = np.arange(1896, 2028, 4)
print("Correlation between year and % disparity = {}".format(correlation))
plt.figure(figsize=(8,8))
plt.plot(mf_year_stats.Year, mf_year_stats.Disparity_pct, label='Data')
plt.plot(year_predict,disparity_fit(year_predict), label='Fit')
plt.plot()
plt.xlabel('Year')
plt.ylabel('% Disparity ')
plt.legend()
plt.show()
plt.close()
Here we have a plot showing how the gender disparity in Olympics has changed over the years. The strong negative correlation shows the participation of woman has improved significantly since 1896 with current disparity about 15%.
To analyse this, let's look at the average age of players as function of sports they play.
# Query to find the average age for each sportss
age_stats = pysqldf("SELECT Sport, AVG(Age) AS Avg_Age \
FROM \
(SELECT ID, Sport, Age \
FROM player_stats \
WHERE NOT(Age ISNULL) \
GROUP BY ID, Age )\
GROUP BY Sport")
#print(age_stats)
#Let's compute the basic stats - Mean, median and standard deviation
mean_age = age_stats.Avg_Age.mean()
std_dev = age_stats.Avg_Age.std()
median_age = age_stats.Avg_Age.median()
print("Median age for all players = {} \nMean age of all players = {} \nStandard Deviation = {}"\
.format(age_stats.Avg_Age.median(),age_stats.Avg_Age.mean(), age_stats.Avg_Age.std()))
#Let's plot the data
plt.figure(figsize=(15,6))
plt.scatter(np.arange(len(age_stats.Sport)), age_stats.Avg_Age)#, align='center', alpha=0.5)
plt.bar(np.arange(len(age_stats.Sport)), age_stats.Avg_Age, align='center', alpha=0.5)
plt.xticks(np.arange(len(age_stats.Sport)), age_stats.Sport)
#plt.axhline(y=25)
plt.axhline(y=mean_age+std_dev, ls='--', c='r',label="Mean+Std Dev")
plt.axhline(y=median_age, label='Median Age')
plt.axhline(y=mean_age-std_dev, ls='--', c='r',label="Mean-Std Dev")
plt.xlabel('Sport Type')
plt.ylabel('Average Age')
plt.tick_params(labelrotation=90)
plt.legend()
plt.show()
plt.close()
age_stats.hist(bins=40)
Analysis - The plot above shows that for most sports, the average age falls in the bracket of $27.2 \pm 5.3$ years. Hence, there doesn't seem any strong dependence between the average and sport type. There are some clear outliers such as Alpinism, Art Competitions, Equestrianism, Motorboating, Polo, Rhythmic Gymnastics, Roque and Swimming. For Alpinism, Art Competitions, Equestrianism, Motorboating, Polo, and Roque, the average age is significantly higher than average which suggests that experience is more relevant in these sports. On the other hand, for swimming and gymnastics, agility is favoured over experience.
Let's compare how this distribution and the conclusion change when we consider only the top contenders from each sport event. Note that, here we assume that the average age of players would not be significantly affected over the years and ignore the variations of mean age over time for each sport.
# Query to find the average age for each sportss
age_stats_topplayers = pysqldf("SELECT Sport, AVG(Age) AS Avg_Age \
FROM \
(SELECT ID, Sport, Age \
FROM player_stats \
WHERE NOT(Age ISNULL) AND NOT(Medal ISNULL)\
GROUP BY ID, Age )\
GROUP BY Sport")
#print(age_stats_topplayers)
#Let's compute the basic stats - Mean, median and standard deviation
mean_age_tp = age_stats_topplayers.Avg_Age.mean()
std_dev_tp = age_stats_topplayers.Avg_Age.std()
median_age_tp = age_stats_topplayers.Avg_Age.median()
print("Median age for top players = {}, Median age for top players = {}".format(median_age, median_age_tp))
print("Mean age of all players = {}, Mean age of top players = {}".format(mean_age, mean_age_tp))
print("Standard Deviation for all players = {}, Standard Deviation for top players= {}".format(std_dev, std_dev_tp))
#Let's plot the data
plt.figure(figsize=(15,6))
plt.scatter(np.arange(len(age_stats_topplayers.Sport)), age_stats_topplayers.Avg_Age)#, align='center', alpha=0.5)
plt.bar(np.arange(len(age_stats_topplayers.Sport)), age_stats_topplayers.Avg_Age, align='center', alpha=0.5)
plt.xticks(np.arange(len(age_stats_topplayers.Sport)), age_stats_topplayers.Sport)
#plt.axhline(y=25)
plt.axhline(y=mean_age+std_dev, ls='--', c='r',label="Mean+Std Dev")
plt.axhline(y=median_age, label='Median Age')
plt.axhline(y=mean_age-std_dev, ls='--', c='r',label="Mean-Std Dev")
plt.xlabel('Sport Type')
plt.ylabel('Average Age')
plt.tick_params(labelrotation=90)
plt.legend()
plt.show()
plt.close()
As we see here, the results don't change significantly with only minors differences in mean age and standard deviation. Thus we can conclude that average age of most sports do not vary significantly and lies with 1-sigma of age distribution with average age of ~$26$ years with standard deviation of $\pm 5$ years.
Here we want to look at the performance of countries in the olympics and wish to see if certain countries have always dominated majority of olympic events over the years or if there is uniform variation across the sport events and leading countries have changed over the years.
To do this, let's first look at the top ten leading countries in total events tally for all the olympic events. We can also rank them based on event tally for each game and look at countries which have consistently finished in top ten. Here we use both these statistics and compare them at the end.
#Let's remove the records with no regions
games_with_only_medals = pysqldf("SELECT ID, Games, Year, Event, Sport, Region, COALESCE(Medal, 'None') AS Medal\
FROM player_stats \
WHERE NOT(Region ISNULL)")
games_with_only_medals.info()
#Now let's look at total number of events played during these games
total_events = pysqldf("SELECT Games, Year, Event\
FROM games_with_only_medals\
GROUP BY Games, Year, Event")
#print(total_events.Event.unique())
num_total_events = total_events.Event.size
print("Total number of events played in all the games = {}\n".format(total_events.Event.size))
print("It might seem peculiar that although number of events are {}, but number of medals are more then 3 times \
of this. However, recall that in Olympics, apart from individual events, there are also team events where instead \
of 3, there can be more than 45 medals awarded per such event. For example, In Rugby Men's Team event for year \
1900 had 47 medals awarded in total (Check the stats below)".format(total_events.Event.size))
#Example - Rugby Men's Team event for year 1900 had 47 medals awarded in total
pd.set_option('display.max_colwidth', 100)
rugby_medals = pysqldf("SELECT COUNT(Medal) AS 'Total Medals Awarded'\
FROM (\
SELECT Games, Year, Event, Region, ID, Medal\
FROM games_with_only_medals\
WHERE Event LIKE '%Rugby Men%'\
AND Year=1900\
GROUP BY Games, Year, Event, Medal, ID)\
")
print("Total medals awarded in Rugby Men's Team event in 1900 = {}".format(int(rugby_medals["Total Medals Awarded"])))
#Now let's look at top 40 countries who won maximum number of events.
total_events_per_reg = pysqldf("SELECT Region, Count(Event_Won) AS Total_Events_Won,\
Count(Event_Won)*100./{} AS Event_Won_Pct\
FROM(\
SELECT Games, Year, Region, Event AS Event_Won\
FROM games_with_only_medals\
WHERE NOT(Medal LIKE 'None')\
GROUP BY Games, Year, Event, Region\
)\
GROUP BY Region ORDER BY Count(Event_Won) DESC LIMIT 20\
".format(num_total_events))
print("Top 5 countries which won highest number of events - ")
print(total_events_per_reg.head(5))
bar = total_events_per_reg.plot.bar(x='Region', y='Total_Events_Won', rot=90, figsize=(10,8))
#game_hist.set_ylabel('Number of Medals')
From the above plot, USA, Russsia and Germany clearly standout winning about 34%, 26% and 24% of the total events ever occured in Olympics. This suggest that few countries have clearly dominated the Olympic games. Let's rank the countries based on their event tally for each game and look at the countries which have appeared maximum number of times in top ten places.
#Top ten countries who won maximum number of events per game
event_stats_per_game = pysqldf("SELECT *\
FROM \
(SELECT Year, Region, Count(Event) as EventCount,\
RANK() OVER (PARTITION BY Year ORDER BY COUNT(Event) DESC) AS Ranking \
FROM (\
SELECT Year, Region, Event\
FROM player_stats \
WHERE NOT(Medal ISNULL) \
GROUP BY Year, Region, Event \
ORDER BY Year ASC, Event DESC\
)\
GROUP BY Year, Region\
) \
WHERE ranking<10")
#Let's now check countries who have consistently performed better than others
top_performers_per_game = pysqldf("SELECT Region, Count(Ranking) AS Num_Times_In_Top_Ten\
FROM event_stats_per_game \
GROUP BY Region\
ORDER BY COUNT(Ranking) DESC LIMIT 20")
print("Total games over years = {}".format(player_stats.Year.unique().size))
print(top_performers_per_game.head(5))
bar = top_performers_per_game.plot.bar(x='Region', y='Num_Times_In_Top_Ten', rot=90, figsize=(10,8))
#game_hist.set_ylabel('Number of Medals')
The above plot shows list of countries which have consistently performed in Olympics. x axis lists the name of countries and y axis describes total number of Olympin games in which each country has reached to one of the top ten spots based on number of events won (not number of medals). Here, I have used "total events won" count as ranking criteria since in case of group events, one event would contribute the several medals and hence, medal count can lead to biased picture.
This is consistent with our previous plot of total events won by countries over entire Olympic history with some changes in country rankings. We see again that USA, Germany, France and UK continue to be in top five leading nations. USA has been most consistent with managing a spot in top ten positions in 34 out of 35 games followed by Germany, France, Itals and UK at 32, 27, 36 and 23. While Russia has won more number of events than Italy, it has not appeared in top ten positions as much as Italy.
The above two measures show that few countries have clearly dominated the Olympics over all the years. Let's look at the performance of top five countries over time by comparing the number of events won by them in each game.
usa_stats = pysqldf("SELECT * FROM event_stats_per_game WHERE Region='USA' ORDER BY Year")
germany_stats = pysqldf("SELECT * FROM event_stats_per_game WHERE Region='Germany' ORDER BY Year")
france_stats = pysqldf("SELECT * FROM event_stats_per_game WHERE Region='France' ORDER BY Year")
uk_stats = pysqldf("SELECT * FROM event_stats_per_game WHERE Region='UK' ORDER BY Year")
italy_stats = pysqldf("SELECT * FROM event_stats_per_game WHERE Region='Italy' ORDER BY Year")
russia_stats = pysqldf("SELECT * FROM event_stats_per_game WHERE Region='Russia' ORDER BY Year")
total_events = pysqldf("SELECT Year, SUM(EventCount) AS TotalEvents FROM event_stats_per_game GROUP BY Year ORDER BY Year")
plt.figure(figsize=(15,6))
plt.plot(total_events.Year, total_events.TotalEvents, label='Total Events')
plt.plot(usa_stats.Year, usa_stats.EventCount, label='USA')
plt.plot(germany_stats.Year, germany_stats.EventCount, ls='--', label='Germany')
plt.plot(france_stats.Year, france_stats.EventCount, ls=':',label='France')
plt.plot(uk_stats.Year, uk_stats.EventCount, ls='-.',label='UK')
plt.plot(italy_stats.Year, italy_stats.EventCount, ls='--',label='Italy')
plt.plot(russia_stats.Year, russia_stats.EventCount, ls='-',label='Russia')
plt.xlabel('Year')
plt.ylabel('Events Won')
plt.xlim
plt.legend()
plt.show()
plt.close()
Above plot shows the number of events won by top performers - USA, Germany, France, UK, Italy and Russia for each game. The performance of USA had been improving remaining the best out of the six nations. Similar results hold for Germany, though after 1990s its performance has been decreasing. UK, France and Italy have almost consistent performance with no significant change till 1990s. UK has since started improving. Even though Russia has won a lot of events, it ranked low in number of times it rose in top ten. This is evident here when we see that Russia's participation in Olympics is mainly recorded after 1952. A quick online search shows that Russia did not participate in Olympics between 1912-1952 which explains the missing data.
To Summarize -
It will be interesting to dig more into these issues. More specifically has gender disparity changed in recent times? Does the relation between age and sports continues to hold when we consider only top contenders three players from each event. How has the nation tally changed in the recent years?